{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ArticutAPI (文截斷詞）\n",
    "## 依語法結構計算，而非統計方法的中文斷詞。\n",
    "### [Articut API Website](https://api.droidtown.co/)\n",
    "### [Document](https://api.droidtown.co/document/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#載入 ArticutAPI Module\n",
    "from ArticutAPI import Articut, LawsToolkit\n",
    "from pprint import pprint\n",
    "articut = Articut()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 請自行修改 inputSTR 字串內容\n",
    "參數 level :斷詞的深度。數字愈小，切得愈細 (預設: lv2)。  \n",
    "lv1: 極致斷詞，適合 NLU 或機器自動翻譯使用。呈現結果將句子中的每個元素都儘量細分出來。  \n",
    "lv2: 詞組斷詞，適合文本分析、特徵值計算、關鍵字擷取…等應用。呈現結果將以具意義的最小單位呈現。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputSTR = \"我想過過過兒過過的日子。\"\n",
    "result = articut.parse(inputSTR, level=\"lv2\")\n",
    "\n",
    "print(\"## Status:\", result[\"status\"], \"\\n\")\n",
    "print(\"## Msg:\", result[\"msg\"], \"\\n\")\n",
    "if result[\"status\"]:\n",
    "    print(\"## Segmentation:\", result[\"result_segmentation\"], \"\\n\")\n",
    "    print(\"## POS:\", result[\"result_pos\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "### 列出斷詞結果所有詞性標記的內容詞\n",
    "可以依需求找出「名詞」、「動詞」或是「形容詞」…等詞彙語意本身已經完整的詞彙。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的 content word.\n",
    "contentWordLIST = articut.getContentWordLIST(result)\n",
    "print(\"## ContentWord:\")\n",
    "pprint(contentWordLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的人名 (不含代名詞).\n",
    "personLIST = articut.getPersonLIST(result, includePronounBOOL=False)\n",
    "print(\"## Person (Without Pronoun):\")\n",
    "pprint(personLIST)\n",
    "\n",
    "personLIST = articut.getPersonLIST(result, includePronounBOOL=True)\n",
    "print(\"\\n## Person (With Pronoun):\")\n",
    "pprint(personLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的 verb word. (動詞)\n",
    "verbStemLIST = articut.getVerbStemLIST(result)\n",
    "print(\"## Verb:\")\n",
    "pprint(verbStemLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的 noun word. (名詞)\n",
    "nounStemLIST = articut.getNounStemLIST(result)\n",
    "print(\"## Noun:\")\n",
    "pprint(nounStemLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的 time (時間)\n",
    "inputSTR = \"火箭在台灣時間25日下午2時30分順利發射\"\n",
    "result = articut.parse(inputSTR)\n",
    "timeLIST = articut.getTimeLIST(result)\n",
    "print(\"## Time:\")\n",
    "pprint(timeLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的 location word. (地方名稱)\n",
    "inputSTR = \"台北信義區百貨商圈附近有許多美食餐廳。\"\n",
    "result = articut.parse(inputSTR)\n",
    "locationStemLIST = articut.getLocationStemLIST(result)\n",
    "print(\"## Location:\")\n",
    "pprint(locationStemLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的 CLAUSE 問句\n",
    "inputSTR = \"請問目前 Intel 最貴的 CPU 售價多少?\"\n",
    "result = articut.parse(inputSTR)\n",
    "questionLIST = articut.getQuestionLIST(result)\n",
    "print(\"## Question:\")\n",
    "pprint(questionLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#列出所有的貨幣金額\n",
    "inputSTR = \"Macbook Pro 新機售價 1799.77 美元\"\n",
    "result = articut.parse(inputSTR)\n",
    "currencyResult = articut.getCurrencyLIST(result)\n",
    "print(\"## currencyLIST:\")\n",
    "pprint(currencyResult)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 調用資料觀光資訊資料庫\n",
    "政府開放平台中存有「交通部觀光局蒐集各政府機關所發佈空間化觀光資訊」。  \n",
    "Articut 可取用其中的資訊，並標記為 <KNOWLEDGE\\_place>。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#允許 Articut 調用字典，列出所有政府開放資料中列為觀光地點名稱的字串。(地點名稱)\n",
    "inputSTR = \"我想去花東國家風景區賞鳥\"\n",
    "result = articut.parse(inputSTR, openDataPlaceAccessBOOL=True)\n",
    "\n",
    "placeLIST = articut.getOpenDataPlaceLIST(result)\n",
    "print(\"## Place:\")\n",
    "pprint(placeLIST)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 調用 WikiData 資料庫\n",
    "Articut 可取用其條目 (Label) 的資訊，並標記為 <KNOWLEDGE\\_wikiData>。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#允許 Articut 調用 WikiData 字典，列出所有 WikiData 條目名稱的字串。\n",
    "inputSTR = \"我在瓶子里發現兩隻邊緣真亮羽水蚤。\"\n",
    "result = articut.parse(inputSTR, wikiDataBOOL=True)\n",
    "\n",
    "wikiDataLIST = articut.getWikiDataLIST(result)\n",
    "print(\"## WikiData:\")\n",
    "pprint(wikiDataLIST)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Local RE (台灣地址)\n",
    "-----------------\n",
    "地址，是一串的字符，內含國家、省份、城市或鄉村、街道、門牌號碼、屋邨、大廈等建築物名稱，或者再加樓層數目、房間編號等。  \n",
    "一個有效的地址應該是**獨一無二**的，才能讓郵差等物流從業員派送郵件，或者上門收件。 \n",
    "\n",
    "羅斯福路四段，指的是地理上的`區域`，而不是一個`地址`，因為光是看到「羅斯福路四段」並無法得知正確的所在位置。\n",
    "```\n",
    "X 台北市大安區羅斯福路四段\n",
    "```\n",
    "\n",
    "羅斯福路四段1號，指的是地理上的`絕對位置`，並可正確的知道所在位置。\n",
    "```\n",
    "O 台北市大安區羅斯福路四段1號\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputSTR = \"台北市大安區羅斯福路四段1號\"\n",
    "result = articut.parse(inputSTR)\n",
    "\n",
    "#列出所有的台灣地址\n",
    "addTWLIST = articut.getAddTWLIST(result)\n",
    "print(\"## Address:\")\n",
    "pprint(addTWLIST)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#使用 localRE 工具取得地址分段細節\n",
    "countyResult = articut.localRE.getAddressCounty(result)\n",
    "print(\"## localRE: 縣\")\n",
    "pprint(countyResult)\n",
    "\n",
    "cityResult = articut.localRE.getAddressCity(result)\n",
    "print(\"\\n## localRE: 市\")\n",
    "pprint(cityResult)\n",
    "\n",
    "districtResult = articut.localRE.getAddressDistrict(result)\n",
    "print(\"\\n## localRE: 區\")\n",
    "pprint(districtResult)\n",
    "\n",
    "townshipResult = articut.localRE.getAddressTownship(result)\n",
    "print(\"\\n## localRE: 鄉里\")\n",
    "pprint(townshipResult)\n",
    "\n",
    "townResult = articut.localRE.getAddressTown(result)\n",
    "print(\"\\n## localRE: 鎮\")\n",
    "pprint(townResult)\n",
    "\n",
    "villageResult = articut.localRE.getAddressVillage(result)\n",
    "print(\"\\n## localRE: 村\")\n",
    "pprint(villageResult)\n",
    "\n",
    "neighborhoodResult = articut.localRE.getAddressNeighborhood(result)\n",
    "print(\"\\n## localRE: 鄰\")\n",
    "pprint(neighborhoodResult)\n",
    "\n",
    "roadResult = articut.localRE.getAddressRoad(result)\n",
    "print(\"\\n## localRE: 路\")\n",
    "pprint(roadResult)\n",
    "\n",
    "sectionResult = articut.localRE.getAddressSection(result)\n",
    "print(\"\\n## localRE: 段\")\n",
    "pprint(sectionResult)\n",
    "\n",
    "alleyResult = articut.localRE.getAddressAlley(result)\n",
    "print(\"\\n## localRE: 巷、弄\")\n",
    "pprint(alleyResult)\n",
    "\n",
    "numberResult = articut.localRE.getAddressNumber(result)\n",
    "print(\"\\n## localRE: 號\")\n",
    "pprint(numberResult)\n",
    "\n",
    "floorResult = articut.localRE.getAddressFloor(result)\n",
    "print(\"\\n## localRE: 樓\")\n",
    "pprint(floorResult)\n",
    "\n",
    "roomResult = articut.localRE.getAddressRoom(result)\n",
    "print(\"\\n## localRE: 室\")\n",
    "pprint(roomResult)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基於 TF-IDF 算法的關鍵詞抽取\n",
    "* sentence 為要提取關鍵詞的文本\n",
    "\t* topK 為提取幾個 TF-IDF 的關鍵詞，預設值為 20\n",
    "\t* withWeight 為是否返回關鍵詞權重值，預設值為 False\n",
    "\t* allowPOS 僅抽取指定詞性的詞，預設值為空，亦即全部抽取\n",
    "* articut.analyse.TFIDF(idf_path=None) 新建 TFIDF 物件，idf_path 為 IDF 語料庫路徑"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfidfResult = articut.analyse.extract_tags(result)\n",
    "print(\"## TF-IDF:\")\n",
    "pprint(tfidfResult)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用 Textrank 演算法\n",
    "* sentence 為要提取關鍵詞的文本\n",
    "\t* topK 為提取幾個 TF-IDF 的關鍵詞，預設值為 20\n",
    "\t* withWeight 為是否返回關鍵詞權重值，預設值為 False\n",
    "\t* allowPOS 僅抽取指定詞性的詞，預設值為空，亦即全部抽取\n",
    "* articut.analyse.TextRank() 新建 TextRank 物件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "textrankResult = articut.analyse.textrank(result)\n",
    "print(\"## Textrank:\")\n",
    "pprint(textrankResult)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 列出目前可使用的 Articut 版本\n",
    "通常版本號愈大，功能越多。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "versions = articut.versions()\n",
    "if versions[\"status\"]:\n",
    "    print(\"## Avaliable Versions:\")\n",
    "    pprint(versions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用法律文件工具\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputSTR = \"\"\"被告前因非法持有槍械，業經前案判決非法持有可發射子彈具殺傷力之槍枝罪，處有期徒刑參年陸月，併科罰金新臺幣拾萬元。\n",
    "於前案偵查過程中，南投縣政府警察局集集分局之員警，持本院核發之105年度聲搜字第165號搜索票。\"\"\"\n",
    "result = articut.parse(inputSTR, level=\"lv2\")\n",
    "lawsToolkit = LawsToolkit(result)\n",
    "print(\"## Crime\")\n",
    "pprint(lawsToolkit.getCrime())\n",
    "print(\"## Criminal Responsibility\")\n",
    "pprint(lawsToolkit.getCriminalResponsibility())\n",
    "print(\"## Event Reference\")\n",
    "pprint(lawsToolkit.getEventRef())\n",
    "print(\"## Law Article\")\n",
    "pprint(lawsToolkit.getLawArticle())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
